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Summary 

The high reliability requirement of flight-crucial sys- 
tems demands the use of a rigorous methodology for val- 
idation. In the 10~ 9 probability-of-failure regime, every 
potential area of failure must be considered. Traditional 
validation methods are inadequate, because they either 
require exorbitant lengths of time for testing or assume 
failure independency. 

This paper presents a validation methodology for 
a fault-tolerant clock synchronization system utilizing 
formal design verification and experimental testing. 
The validation method relies on the formal proof pro- 
cess to uncover design and coding errors, and utilizes 
experimentation to validate the assumptions of the de- 
sign proof. The experimental method is presented and 
described in detail. To demonstrate the feasibility of 
the method, the clock synchronization algorithm for the 
Software Implemented Fault Tolerance (SIFT) system 
was implemented and validated in the Langley Avionics 
Integration Research Laboratory (AIRLAB). 

The design proof of the SIFT clock synchroniza- 
tion algorithm defines the maximum skew between any 
two clocks in the system in terms of theoretical up- 
per bounds on certain system parameters. These upper 
bounds are estimated as extremely large quantiles, so 
large that the probability of exceeding them is less than 
10 9 . The quantile to which each parameter must be 
estimated is determined by a combinatorial analysis of 
the system reliability. The parameters are measured by 
direct and indirect means, and upper bounds are esti- 
mated. A nonparametric method based on an asymp- 
totic property of the tail of a distribution is used to es- 
timate the upper bound of a critical system parameter. 
Finally, trade-offs between performance and reliability 
are discussed. 

Introduction 

Clock synchronization is an essential function in 
fault-tolerant multicomputer systems. Most fault- 
tolerant flight-control systems utilize exact-match vot- 
ing algorithms that depend critically upon the syn- 
chronization of the redundant computing elements. In 
fact, in many systems the entire communication mech- 
anism depends fundamentally on maintaining adequate 
synchronization between the replicated system clocks. 
Typically, a maximum clock skew is assumed and uti- 
lized as shown in figure 1. If the clock synchroniza- 
tion scheme fails, then system failure quickly follows. 
Clearly a fault-tolerant system is only as reliable as its 
synchronization subsystem. Therefore, any validation 
effort must include a careful analysis of the synchroniza- 
tion subsystem of a fault-tolerant computer system. 

The problem of validating the fault-tolerant clock 
synchronization algorithm used in the Software Imple- 


Pj sends data at time T (preagreed) 



P 2 reads after time T + B + <5 


Figure 1. Interprocess communication. 

mented Fault Tolerance (SIFT) computer system, an 
experimental fault- tolerant computer system designed 
for flight-crucial applications, is discussed in this report. 
The weaknesses of classical validation methods are dis- 
cussed, and a new method of validation that relies on 
a combination of formal design proof and experimental 
testing is introduced. 

We are grateful to Larry Lee for many helpful discus- 
sions on statistical theory and for his recommendation 
of the Weissman estimation method. 

Symbols 

A qp actual skew between clock q and clock p 

B maximum broadcast time 

C{t) clock value at real time t 

CM function defining clock during zth synchro- 
nization interval 

E( ) statistical expectation of a random variable 

e qp difference between actual value of clock q and 
value read by processor p (i.e., read error) 

G s Gini statistic 

k number of largest observations used in Weiss- 

man’s statistical method 

m maximum number of faulty processors 

accommodated 

N number of processors in system 

Ph probability of one or more hardware faults on 
a specific processor during a mission 

Psys probability of system failure during a mission 

p c probability of obtaining a read error > e 
during a single clock read 

Pi probability of estimate of maximum drift rate 

being too small 

P 2 probability of one or more read errors > e 
occurring on a specific processor during a 
mission 

R synchronization period 



i?w interval [T« ,T( i+1 )] 

r(T) real time when clock value is T 

5 execution time of synchronization task 

T clock time 

TW clock time at beginning of zth synchronization 
period 

t real time 

v a system design parameter approximately 

equal to mean communication delay 

W 9 standardized form of Gini statistic 

X qp communication delay when sending clock data 
from processor q to processor p 

A qp skew between processor p’s clock and proces- 
sor q's clock as perceived by processor p (i.e., 
actual skew + read error) 

6 maximum clock skew 

e maximum error in reading another processor’s 

clock 

p difference between v and E(X qp ), equal to 
E(e qp ) 

p maximum drift rate between any two clocks 
cr p standard deviation of measurements of p 

fi maximum correction factor 

estimate of a parameter 

Inadequacy of Classical Validation Methods 

Because of the criticality of synchronization systems, 
it is imperative that credible methods of validation be 
developed for these systems. However, severe require- 
ments for flight-crucial systems, such as a probability of 
failure not to exceed 10 -9 for a 10-hour flight, preclude 
the use of classical life testing as an assessment method. 
Furthermore, typical alternatives to life testing, such as 
combinatorial arguments or Markov models, are inad- 
equate because they assume failure independency. Al- 
though they are physically separated, clock failures are 
not independent, because each clock uses values from 
the other clocks in the system to remain synchronized 
with them. Because of this failure dependency, a fault- 
tolerant clock synchronization algorithm is used to pre- 
vent the propagation of a clock failure to another clock 
in the system. The validation process must establish 
the correctness of this algorithm in a system context. It 
must be demonstrated that a single faulty clock cannot 
compromise the system reliability. Thus, the following 
failure modes must be considered: 


Clock 3'a value as 
perceived by processor 1 




• — Midvalue selected by processor 1 
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— Midvalue selected by processor 2 


Clock 3's value as 
perceived by processor 2 


Thus, neither clock ’’corrects" 
itself , and they continue 
to drift apart 


Figure 2. Impact of malicious clock on midvalue select 
algorithms. 


1. A majority of clocks fail before time T. 

2. An error exists in the system design. 

3. An error occurred in coding the synchronization 
algorithm. 

4. Even though none of the above have occurred, the 
assumptions of the design have been exceeded. 

Combinatorial calculations can help only in estimating 
the probability of failure mode 1. Yet, there is often 
a far greater danger of system failure due to the other 
failure modes. Consider the classical 3-clock midvalue 
select algorithm. Although “intuitively” correct, this 
algorithm was proven to be faulty by SRI International. 
(See ref. 1.) A malicious clock which sends different 
values to different clocks can defeat the algorithm. (See 
fig. 2.) To avoid the possibility of failure mode 2, 
SRI International developed a new algorithm and a 
mathematical proof characterizing the performance of 
this algorithm in terms of certain system parameters. A 
mechanical verification of the proof (i.e., using formal 
software verification methods and automatic theorem 
provers), if performed, could virtually eliminate this 
failure mode. The use of code verification could also 
eliminate failure mode 3. However, the possibility of 
a mode-4 failure must be considered. Fortunately, the 
formal proof process encapsulates precisely the design 
assumptions in the form of a set of axioms. Thus, the 
following validation method is appropriate: 

1. Mathematically prove a theorem which character- 
izes the maximum clock skew permitted by the 
synchronization algorithm in terms of measurable 
system parameters. These parameters are defined 
through formal axioms. 

2. Mechanically verify that the implementation code 
correctly implements the algorithm. 
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3. Experimentally verify the axioms required in the 
design proof above. 

Although the SIFT synchronization code has not yet 
been mechanically checked, a mathematical design 
proof has been performed on the algorithm. The 
mechanical proof will be performed by SRI Interna- 
tional under NASA contract NAS1-17067 during 1984 
and 1985. In the following section, this algorithm and 
its proof are discussed. 

SRI Clock Synchronization Algorithm 

To discuss the SRI clock synchronization algorithm 
properly, it is necessary to introduce a few definitions 
and some notation. The theory in this section was 
developed by SRI under the SIFT development contract 
NAS 1-15428 (see ref. 1). 

It is convenient to define a clock as a function from 
real time t to clock time T : C(t) = T. Real time 
is distinguished from clock time by the use of small 
letters for the former and capital letters for the latter. 
It is sometimes useful to use the inverse clock function 
r(T) = C "' 1 (T) = t. Using this inverse function, the 
concept of synchronization can be defined as follows: 

Definition : Two clocks r p and r q are synchronized to 
within 6 of each other at time T if 

|r,(T)-r,(r)|<« 


(1 + p/ 2)T 



Figure 3. Definition of a good clock. 

T« = T(°) +iR, and = [T«,T( <+1 )]. For each 
such interval there is a new clock definition as follows: 

C( <+1 >(t) = C«(t) + A« 

where is the *th clock correction. 

The clock synchronization algorithm requires that 
each processor exchange clock values with every other 
processor during the subinterval 

SfW — ry(*+i) — 


Next, a good clock is defined as follows: 


Definition: A clock r is a good clock during the interval 
[TijTg] if it is a monotonic, differentiable function on 
[Ti , T 2 ] and if there exists a p such that for ail T in 
[Ti,T 2 ]: 


dr 

dT 


(T) - 1 


<t 


Thus, the drift rate of a good clock from real time is 
bounded by p/2 as illustrated in figure 3. 

A clock synchronization algorithm periodically re- 
sets the clocks in the system. This process may be 
viewed as redefining the mathematical clock function: 


which is the last S seconds of the interval 72 W. Since 
this clock value exchange is subject to error, it is 
necessary to introduce a notation and an axiom which 
characterize this error: 

Axiom : If processors p and q are nonfaulty and their 
clocks are synchronized up to time T^ i+1 \ then p 
obtains a value A qp during the interval SW, such that 

rW (T 0 + A gp )-r« (T 0 )|<e 

for some time T 0 in SW. Thus, the error in reading 
another processor’s clock is bounded by e. 

The SIFT synchronization algorithm is as follows: 


r*(T) ~ r(T — A) 


Algorithm : For all clocks p, 


or equivalently 


c£ +1) = C® + Ap 


c*(t) = C(t) + A 

Here, the new clock C* is obtained by incrementing 
clock C by A seconds. As the processors synchronize 
clocks every 72 seconds, the time base of each processor 
is a sequence of redefined clock functions. Using 
as the clock time at the beginning of the 2 th interval, 


where 


N 


Ap = (l/JV)£A rp 


r= 1 


If r ^ p and |A rp | < fi, then 


A rp — A rp 
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else 


where 


A rp — 0 


H « 6 + e 


The following theorem was proved by Leslie Lam- 
port and Michael Melliar-Smith of SRI International: 

Theorem : If 


3m < TV, 


6 > 


N 


N — 3m 
6 > S Q + pR) 


{ 2e+ ,[* +2 (^) s ]}, 


and 

6 < e/p 


and if no more than m processes are faulty up to time 
then for all clocks p and q : 

I. If processes p and q are nonfaulty up to time T^ +1 ), 
then for all values of T in R W : 


r«(T)-r«(T) 


< 6 


2. If process p is nonfaulty up to time T^ +1 ), then 


r (i+i)( T )_ r W( T ) 


< n 


where 

e maximum error in reading another processor’s 
clock 

p maximum drift rate between any two clocks 

m maximum number of faulty clocks accommo- 
dated 

N number of clocks in system 

R synchronization period 

S execution time of synchronization task 

A qp skew between processor p’s clock and proces- 
sor g’s clock as perceived by processor p (i.e., 
actual skew + read error) 


Assumptions of Design Proof 

The design proof effectively establishes the correct- 
ness of the algorithm, assuming a set of axioms is cor- 
rect. Many of these axioms are well-established mathe- 
matical theorems. Other axioms define the behavior of 
the computer system on which the algorithm executes. 
For example, the SRI design proof assumes that every 
processor can read another processor’s clock to within 
an error of e. The correctness of such assumptions must 


be established by experimentation. The following is a 
list of the system behavior axioms which are assumed 
in the SRI design proof: 

1. The maximum drift rate between any two working 
clocks is < p. 

2. If two processors are nonfaulty, then one processor 
can read another processor’s clock to within an error 
of e. 

3. The clocks of the system are initially synchronized 
to within 6 0 . 

4. The system executes the algorithm every R seconds 
and provides enough CPU time for the algorithm to 
complete. 

Each of these assumptions must be established to a 
confidence level consistent with the reliability require- 
ments. Thus, although life testing of the system as a 
whole can be avoided, life testing must effectively be 
performed on certain system parameters. However, the 
behavior of these low-level components of the system 
are typically far less complex than the system as a 
whole, and the components are thus easier to validate. 
Furthermore, the formal proof provides a precise state- 
ment of exactly which properties of the system must be 
measured. 

Measurement of System Parameters 

The mathematical theorem defines the worst-case 
clock skew in terms of certain system parameters. These 
parameters are specific to each implementation of the 
algorithm. The SIFT system was originally imple- 
mented on seven Bendix BDX-930 processors. How- 
ever, extracting synchronization data from that system 
is currently difficult. To investigate methods of vali- 
dation, the SIFT synchronization algorithm was imple- 
mented on four VAX-11/750 computers in AIRLAB. 
Some of the system parameters are directly measurable 
on the VAX system. Others, such as the maximum 
read error e must be indirectly measured. The results 
of these measurements are analyzed in a subsequent sec- 
tion, after the explanation of the theoretical method. 

The first parameter to be discussed is 6, because it is 
the most difficult to measure and plays a significant role 
in the system performance. As defined previously, e is 
an upper bound on \e qp | over all processor pairs and over 
all time, where e qp = r p ^ (T 0 + A qp ) — r q ^ (T 0 ). Unfor- 
tunately, e qp is defined in terms of real time rather than 
observable clock time. In the appendix, the following 
formula defining e qp in terms of observable clock times 
is shown to be a highly accurate approximation of the 
theoretical e qp : 


A qp H- o qp — A qp 
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and 


|Cq P | ^ ^ 

where A qp is the difference between clocks p and q at 
real time t (i.e., actual skew at t): 

Aq P (t) = C p (t) — C q {t ) 

It is still necessary to characterize e qp in a system 
context. In the system, a processor p reads a processor 
q' s clock by the following method: At a prespecified 
time, processor q reads its clock and transmits the value 
C q (t i) to processor p. Upon receiving the message, 
processor p immediately reads its clock to obtain Cpfa). 
As shown in figure 4, if the exact communication delay 
X qp were known, then the exact skew A qp could be 
calculated by 

A q p = Cp(^2) — C q (t i) — Xqp 

Thus, the designer of the synchronization system 
chooses a value v approximately equal to E(X qp ) to 
be used by the system to compute an apparent skew 
Aqp by the following formula: 

A qp = Cp(t2) ~ l) — V 

Because the communication delay is variable, each cal- 
culation of Aqp is subject to an error of X qp - v . There 
are two components to this error, and they are shown 
in the following equation: 

€qp — Xqp “ V — [Xqp — E (. X qp )\ + p 

where 

P — El^Xqp} V 

The first component [X qp - E(X qp )] is the variation 
due to the random nature of the communication. The 
second component p is a constant offset due to the 
system designer’s error in choosing v. Also, it follows 
from the above formula that E(e qp ) = p , 

The distribution of e qp may be obtained from mea- 
surements of the one-way communication delays. How- 
ever, this requires some form of special hardware. In 
the AIRLAB VAX system, a special Pulse Network was 
used to measure this delay. The delay for sending a 
pulse is considerably less variable than the communica- 
tion delay for sending a message. Thus, reasonably ac- 
curate measurements of the communication delay were 
made by the following method: One processor’s clock 
was read, and the value was sent to a second proces- 
sor. When the second processor received the message, 
a pulse was immediately sent over the Pulse Network 
to the first processor. When the first processor received 
the pulse, its clock was read again. By subtracting 
the first clock value from the second clock value, the 


Read clock 



*1 *2 
Xqp = C p (t 2 ) ~ 

Aqp - Cp(tl) ~ Cg(tj) - Cp(t 2 ) - Cq(tp ~ X qp 

Figure 4. Calculation of actual skew A qp . 

communication delay plus the pulse delay were mea- 
sured. Subtracting an estimate of the mean pulse delay 
from this value provides an accurate measurement of 
the communication delay. Figure 5 is a histogram of 
2000 such estimates of X qp . 

Next, a method is discussed that provides a means of 
estimating both 6 and p using the internal state informa- 
tion of the synchronization system. Thus, the method 
requires no special external measuring hardware. The 
physics of crystal clocks dictates that the drift rate p qp 
between any two clocks q and p is constant over time. 
Thus, if the system is run without synchronization, the 
model 

Aq p (T) = <5q p (0) + pqpT + e qp (T) 

describes the system, where 6 gp (0) is the initial skew 
between clocks q and p at time 0, and p qp is the drift 
rate between clocks q and p. The values of A gp (T) are 
directly observable from the various processors memo- 
ries. Since the Aq P ’s are computed every R seconds, 
the data consist of Aq P (TW), where T*W = T(Q) + iR 
and i = 1 ,n. Thus, a linear least-squares analysis can 
be used to estimate the parameters 6 qp ( 0) and p qp . (See 
fig. 6.) The residuals from the regression can be repre- 
sented by 0, and the equation can be written as: 

Aq P (r) = d + /3T + <£ 

If the eqp’ s are distributed with mean zero, then /3 
is an estimate of p qP) and the residuals have approxi- 
mately the same distribution as the e^p’s. However, this 
may not be the case. Suppose that E(e qp ) = p ^ 0. 
Then, a = 6q P (0) + p , and the residuals have approxi- 
mately the same distribution as e qp - p. Thus, a his- 
togram of e 9P ’s can be obtained by adding p to the 
residuals. 

Next, a simple independent method to estimate p is 
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presented. Since A qp — A qp + e gp , 

Ag p ”1” A pq = f'qp ”t“ & pq d” A qp A pq 
Furthermore, since A qp = - A pq , 

A qp — G qp ~f* 6pq 

Thus, 

E(e qp )=E(A^ 

since i£(e gp ) = i£(e p<2 ). Therefore, (i can be estimated 
by the sample average of several observations of (A gp + 
A pq )j2. 

In a properly “tuned” synchronization system, fi = 
0. When (i = 0 in a system, the absolute value of 
the read error is minimized. The following method 
may be used to tune the synchronization system. As 
shown earlier, the e qp ’s consist of two components — the 
deviation from the mean communication time, and the 
constant offset fi . Since fi can be expressed as follows: 

M = E{X qp ) -v = E(X qp -v) = E(e qp ) 

the constant offset can be eliminated by simply adding 
the above estimate of /i to v. The code in the imple- 
mentation may thus be corrected to use this new value 
of v. 

Next, an upper bound on \e qp \ must be estimated 
from e qp histograms obtained from either method. A 
statistical approach to this problem is discussed in the 
following section. 

A Statistical Approach to Estimating e 

In this section, the problem of estimating an upper 
bound of \e qp \ from histograms of experimental data is 
addressed. The motivation for this exercise is to sta- 
tistically estimate the parameter e with confidence con- 
sistent with the reliability requirement of a probabil- 
ity of failure not to exceed 10~ 9 for a 10-hour flight. 
The quantile of the e qp distribution to which e must 
be equated is specific to each system and may be de- 
termined by a reliability analysis of the system. This 
quantile is usually on the order of 1 — 10 -9 . Unfortu- 
nately, the traditional method of estimating a quantile 
requires an exorbitant sample size in this case, as shown 
in the following paragraph. 

Let xi,x 2 ,x 3 ,...,x n be a random sample from a 
distribution F(x). The pth sample quantile is the 
number £ p , such that the fraction of x*’s that are less 
than £ p is < p, and the fraction of x*’s greater than £ p is 
< 1— p. Thus, to estimate the theoretical £ p quantile by 
this technique, one must observe at least one Xi greater 
than the £ p quantile. This requires an extremely large 


sample size as may be seen by the following calculation. 
Let Z = max(xi,x 2 , x 3 , . . . , x n ). Then, 

Prob (Z > f p ) = 1 - Prob(Z < £ p ) 

= 1 -Prob(xi < £ p , 

•^2 ^ £pj • • • j %n ^ £p) 

n 

= 1 - Prob(a;i < £ p ) 

i 

= 1 -p n 

Thus, to be (1 - a) x 100 percent confident that an 
observation will exceed f p , n must be chosen such that 

1 - p n = 1 - a 
or 

n = ln(o;)/ln(p) 

For a = 0.75 and p = 1 — 10 -9 , n = 2.876 x 10 8 obser- 
vations. Such a large sample size is usually impractical. 
The only remaining choice is to assume an underlying 
parametric model of the distribution X or to assume 
some special properties of the distribution. Sometimes 
a parametric model can be theoretically derived from an 
analysis of the communication system. Often, however, 
such a model is not obtainable. Furthermore, any sta- 
tistical inference made would be strongly dependent on 
the assumption that the experimental data were gener- 
ated from the chosen parametric family of distributions. 
The danger inherent in such a method can be seen in 
figure 7. In this figure, an attempt to fit a random 
sample of X qp to a 3-parameter Weibull distribution is 
illustrated. (See ref. 4.) Although this is a very large 
family of distributions, an unfortunate situation is ob- 
served. The estimated 1 — 10 -9 quantile is smaller than 
several observed data values! (See fig. 7.) Even when 
the fit is statistically very good, the model may prove 
inappropriate for inference with respect to the tail of the 
distribution. Clearly, some other method of estimating 
properties of the tail of a distribution is needed. 

Although the estimation of the maximum was in- 
tractable when no assumptions were made about the 
underlying distribution, certain minimal assumptions 
can simplify the estimation problem considerably. The 
statistical theory presented next was developed by 
Weissman. (See ref. 2.) This theory enables the esti- 
mation of large quantiles of the underlying population 
distribution from the k largest observations of a random 
sample. The major assumption of the method is that 
the underlying distribution function F is in the “do- 
main of attraction” of some known distribution func- 
tion G . This property is satisfied by a large class of 
distribution functions (e.g., the gamma, exponential, 
Weibull, normal, lognormal, and logistic). Therefore, 
the method is essentially nonparametric. Mathemati- 
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Figure 7.- Weibull fit to communication delay histogram. 




cally this assumption is as follows: If x ly x 2 , x 3 , . . . , x n 
is a random sample from a distribution F(x) and Z n is 
the largest observed x, then the distribution function 
for Z n is F n (x ). If there exist sequences a n > 0 and b n 
for all n and a distribution function G such that 

F n ( a n x + b n ) — > G(x) as n — > oo 

for all values of x where G is continuous, then G(x) is 
an extremal distribution function and F(x) lies in its 
“domain of attraction.” (See ref. 3.) This distribution 
function G must be from one of the following families 
of distributions: 

1. LAMBDA(x) = exp(— e -a: ) (-oo < x < oo) 

2. PHI a (x) = exp(— x _a ) (x > 0, a > 0) 

3. PSI a (x) = exp(— (— x) a ) (x < 0, a > 0) 

The theory developed by Weissman makes possible 
the estimation of large quantiles of the underlying dis- 
tribution. His method is based on the result that the k 
largest order statistics, when normalized by constants 
a n and 6 n , converge in distribution to a fc-dimensional 
extremal variate (a vector of variables distributed as 
the ordered times at which events occur in a nonhomo- 
geneous Poisson process). The normalizing constants 
are treated as unknown parameters indexing the lim- 
iting distribution and are estimated by the method of 
maximum likelihood. That is, the estimation problem 
is solved by basing the estimates on the limiting dis- 
tribution rather than on the underlying distribution of 
the e qp . As indicated subsequently, the quantiles of the 
underlying distribution may be represented in terms of 
the parameters a n and b n , thereby obtaining maximum 
likelihood estimates of the quantiles of the underlying 
distribution. 

For example, by choosing e to be the 1 - 10~ 9 quan- 
tile, the probability of the system exceeding the design 
assumption is 10 -9 . The method provides a simple 
method of calculating the large quantiles, once the lim- 
iting distribution family has been determined (i.e., one 
of the three listed previously in this section). Fortu- 
nately, simple statistical tests are available for making 
this determination. These are discussed subsequently. 
The Weissman technique only uses the largest k values 
of the random sample. The choice of the k is arbitrary, 
although it should be small in comparison with the sam- 
ple size (e.g., k = 10 and n = 1000). The value of k is 
chosen prior to the examination of the data. 

Once k is chosen and the limiting distribution is de- 
termined, one of the following calculations is performed 
depending on the limiting distribution. In each case be- 
low X ln > X 2 n > • •• > X kn > ... > X nn represents 
the order statistics of the random sample. 

CASE 1 (G = LAMBDA): 

1 -c/n quantile = a n [~ ln(c)] + b n 


where 


CLyi 




L»=i 




and 


b n = a n In (k) + X kn 


CASE 2 (G = PHI): 

1 - c/n quantile = {k/cY^Xkn 


where 


1/a = 


‘ k 

£ln (X in )/k 

_i=l 


In (Xkn) 


CASE 3 (G = PSI): 

Since case 3 applies only to negative x, it is 
not appropriate for this application and is not 
discussed in detail here. 

The remaining problem is the determination of the 
limiting distribution G. It is possible to test the hypoth- 
esis that G = LAMBDA by testing whether the set of 
normalized spacings L> ln , 2 D 2n , • • • , (* ~ l)£>(*-i)n are 
independent, identically distributed exponential ran- 
dom variables, with D in = X in - X (i+1)n . Similarly, 
the hypothesis G — PHI can be tested by determin- 
ing whether the normalized spacings of In (X{ n ) are in- 
dependent, identically distributed exponential random 
variables. The Gini statistic can be used for these tests. 
(See ref. 4.) The Gini statistic G s is calculated as 
follows: 

s s 

°* = EE \\H ~ Vj\ /2s(s - l)y 

i~l j=l 

where y l9 y 2 , . . . , y s is the random sample being tested 
for exponentiality. The statistic G s necessarily lies 
between 0 and 1, with values near 0 or 1 indicating 
nonexp onent ial i ty. For values of s larger than 20, the 
standardized form of the Gini statistic, 

W a = [12(«-l)] 1 / 2 (G a -0.5) 
fc* the Standard Normal, N(0, 1) 

may be used to determine the significance level. Thus, 
for example, a W s value of 2.96 indicates an observed 
significance level of 1 percent, and permits the rejection 
of the hypothesis that the distribution is exponential 
with 99-percent confidence. 
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A Statistical Approach to Estimating p 

In this section, a method is presented for estimating 
p, the maximum drift rate between any two clocks in 
the system, such that if p qp represents the drift rate 
between clocks q and p, then 

Pqp < P 

for all clocks q and p. This is precisely design assump- 
tion 1 discussed previously. To determine the probabil- 
ity that this design assumption is violated, it is neces- 
sary to calculate the probability that one or more p qp 
exceed the estimated upper bound p, or 

Prob(p gp > p) for some q and p 

If there are n processors in the system being vali- 
dated, then there are n c = ) drift rates between pro- 

cessor pairs. For simplicity, these are referred to as p*, 
where i = 1 to n c . 

Using the linear regression analysis on the A qp (T^) 
data described previously, a set of estimates 

I * = l,»c} 

can be obtained, and p* is an estimate of the drift' rate 
between processor pair i and <r? is an estimate of the 
variance of py 

From the experimental data, an estimate of the 
upper bound of the drift rates p must be determined 
such that Prob(max(pi) > p) = a is sufficiently small. 

The maximum drift rate p may be estimated as 
follows: 

p = max(i^) 
where U{ is defined by: 

Prob (pi > U{) — n i/l — a 

The following theorem shows that this estimate is 
adequate: 

Theorem : Prob(max(p*) < p) > 1 — a. 

Proof: 

Prob(max(pi) < p) 

= Prob(max(pi) < max^;)) 

= Prob(pi < ma x(w»), p 2 < max(ui), . . . , 

p n < ma x(ui)) 

> Prob (px < Ui,p 2 < W 2j .. .,Pn < Un) 
n c 

= n Vi^ 

i 

= 1 - a 


The values of U{ are easily obtained by using the 
following formula: 

Ui = pi + t( v, 0)&i 

where 

v = n s — 2 

n s = number of data points used in regression 
analysis to obtain pi and d; 

e = y (l - a) 

t(is, 6) — 0 percentage point of student’s t distribu- 
tion with v degrees of freedom 

For n s > 100, t(v , 6) may be replaced by a percentage 
point of the standard normal distribution. The follow- 
ing approximation formula for the normal distribution 
F(z) (see ref. 5) is useful for small values of a (i.e., large 

z)-. 

1 — a — F(z) = 1 exp(— 0.5z 2 ) 

zy2'K 


Validation of the AIRLAB Experimental 
System 

In this section, the methods developed in the pre- 
vious sections are combined into a complete validation 
method. This validation method is illustrated by ap- 
plication to measurements made on the AIRLAB ex- 
perimental synchronization system. As described previ- 
ously, this system consists of four communicating VAX- 
11/750 computers. These processors exchange clock 
values and synchronize themselves using the SIFT fault- 
tolerant clock synchronization algorithm. As discussed 
previously, the four major design assumptions which 
must be validated are as follows: 

1. The maximum drift rate between any two working 
clocks is < p. 

2. If two processors are nonfaulty then one processor 
can read another processor’s clock to within an error 
of 6. 

3. The clocks of the system are initially synchronized 
to within 8 0 . 

4. The system executes the algorithm every R seconds 
and provides enough CPU time for the algorithm to 
complete. 

Design assumption 3 corresponds to a process that 
would occur at system initialization. Since this pro- 
cess occurs before system operation, while the aircraft 
is on the ground, it need not be fault tolerant. If the ini- 
tialization process fails, it can be restarted. Detection 
of such a failure is not difficult. Design assumption 4 is 
intimately connected with the performance character- 
istics of the test-specimen operating-system scheduler. 
Analysis of the operating-system scheduler is strongly 
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dependent on the scheduling method employed in the 
system. Therefore, it is not possible to present a generic 
method for validation of this assumption. Hence, only 
the first two design assumptions are analyzed in detail. 

The validation method depends fundamentally on a 
mathematical reliability analysis of the system. Such 
a reliability analysis must include the following three 
probabilities of failure: 

Ph = Prob(one or more hardware component 
failures in a processor) 

Pi = Prob(design assumption 1 being violated) 

P 2 = Prob(design assumption 2 being violated) 

The probability p h may be obtained from a military 
standard 217D analysis of component failure data. This 
analysis method is well-known and therefore is not dis- 
cussed here. A processor failure rate of 10“ 5 /hour is 
assumed. The probability p x is the measurement er- 
ror in determining the upper bound p. This proba- 
bility may be made arbitrarily small by increasing the 
bound and/or increasing the accuracy of the measure- 
ments (e.g., reducing a p by using a larger sample size). 
The probability of failure p 2 arises from the stochas- 
tic nature of the communication system that is used to 
read another processor’s clock. This probability may 
also be reduced by increasing the bound e; however, 
this is done at the expense of increasing the estimated 
maximum clock skew 6. This trade-off is discussed in 
detail in the section “Additional Observations About 
the Clock Synchronization Algorithm.” 

The validation method entails the following steps: 

1. Determine the upper bound p such that Prob (p < p) 
is negligible in comparison with p^. 

2. Determine the probability quantile needed for e from 
a reliability analysis of the system. 

3. Estimate this quantile from experimental data and 
use this value as the maximum read error e . 

4. Compute the maximum clock skew from p and e. 

5. Determine whether this maximum clock skew value 
exceeds the value assumed in the system design. 

Thus, p and e are chosen large enough to meet the 
system reliability requirements. Using these bounds, 
a theoretical maximum clock skew is calculated. If 
the maximum clock skew assumed in the design of the 
system is not greater than the theoretical maximum 
value, then the system is validated. 

Validation Step 1 

In this validation step, p must be determined such 
that pi is small in comparison with p h . This approach is 
desirable because pi is the probability of a measurement 
error rather than an intrinsic failure mode of the system. 
As mentioned previously, p x can be made arbitrarily 


small by improving the measurement technique (i.e., 
reducing a p ). 

A regression analysis of the A qp (TW) experimental 
data produced the following table: 


Processor pair 

i 

Pi 

d; 

Ui 

1,2 

1 

30.02 

0.2687 

31.462 

1,3 

2 

9.01 

.0245 

9.143 

1,4 

3 

35.50 

.0251 

35.635 

2,3 

4 

14.92 

.0954 

15.432 

2,4 

5 

5.48 

.0390 

5.686 

3,4 

6 

40.97 

.0851 

41.427 


The values of U j were cal culated using Ui = f>i+t{y, 6)&i , 
where 0 = Vl - 10" 7 = I - 1.667 X 10“ 8 and v = 
n s — 2 ~ 1998. Thus, p x — 10 -7 , which is small 
relative to p^. The maximum drift rate p is estimated 
by max(u^). Thus, 

p = 41.427 

Validation Step 2 

In this step the required quantile for e must be de- 
termined from a reliability analysis of the system. Typ- 
ically, a detailed Markov model is necessary to calculate 
the reliability of a fault-tolerant system. However, for 
simplicity, it is assumed here that there is no sparing 
capability in the system, and thus a simple combinato- 
rial analysis can be used to compute the reliability of 
the system. Since the algorithm can tolerate m proces- 
sor failures, the probability of m + 1 processor failures 
during the flight must be determined. The system is 
thus a 2-out-of-4 system. 

The probability of a processor failure is as follows: 

P = Ph + Pi + P2 

Therefore, the probability of a system failure during a 
mission of length T is 

Pays = 1 — Prob(no failures) — Prob(l failure) 

= 1- (q )p°(i-p) 4 - Qp^l-p) 3 

= 6 p 2 - 8 p 3 + 3p 4 

^ 6p 2 

Given a reliability requirement of p sys = 10 -9 , a 
processor failure rate of p h = 10 _5 /hour and p x = 
10 -7 , we have 

P2 = Prob(one or more read errors > e 

occurring on a specific processor during 
a mission of length T ) 

= \J Psys / £> Ph Pi = 2.809 X 10 ^ 
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Since the bound c refers to a single clock read, the 
number of clock reads during a mission of length T must 
be determined in order to calculate the quantile needed 
for e. 

Defining p e as 

p e = Prob (obtaining a read error > e 
during a single clock read) 

the probability P 2 is then easily expressed in terms of 
p e as follows: 

Pa = 1- (p)p°(l-p £ ) n 

where 

n = (N — 1 )T/R (i.e., the number of clock reads a 
specific processor makes during a mission 
of length T) 

and 

T mission time 
R synchronization period 
N number of processors in system 

Using the Poisson approximation to the binomial, 

P 2 = 1 - exp(-np e ) 

Furthermore, by a Taylor series approximation (valid 
because np e C 1), 

p 2 = np c = {N- 1 )(T/R)p e 

Using N = 4, R — 30 sec, and T = 10 hours, the 
probability p e is determined as follows: 

p e = 7.805 x 10“ 10 

This analysis assumes independence of clock read- 
error failures. The design proof has thus reduced the 
strong assumption of independent clock failure to inde- 
pendent communication. This analysis makes a strong 
case for avoiding contention-based communication pro- 
tocols in a fault- tolerant architecture. 

Validation Step 3 

The third step in validating the system under in- 
vestigation is to estimate the 1 — p e quantile of the 
read-error distribution. Two methods were developed 
in the preceding sections to obtain a histogram of the 
clock read errors e qp . Figure 8 is a histogram of 
\e qp \ = \X qp — v\ obtained from direct measurements 
of the one-way communication times. These data are 
used to illustrate the determination of e. 

As discussed in a preceding section, the upper bound 
e is determined using Weissman’s technique. Prior to 


examination of the data, k was chosen to be 20 (i.e., 
1 percent of the sample), as recommended by Weiss- 
man. The experimental data best support LAMBDA 
as the limiting distribution; however, there was no clear 
rejection of either of the limiting distributions. The 
standardized form of the Gini statistic Wk-i applied to 
the k — 1 normalized spacings of the k largest observa- 
tions from the sample, and the corresponding observed 
significance levels for the tests were as follows: 




Observed 

Limiting 


significance level, 

distribution 

W 19 

percent 

LAMBDA 

0.5262 

59.2 

PHI 

1.449 

14.7 


The inability to choose the limiting distribution with 
precision is of some concern here. Examining the 
test results for various values of k provides additional 
insight into discerning the limiting-distribution family. 
In figure 9, the standardized Gini statistics for the 
LAMBDA and PHI tests using various values of k 
are plotted. The tests show a consistent tendency 
toward selection of the LAMBDA distribution. In fact, 
for some values of k there is strong rejection of the 
PHI and strong acceptance of the LAMBDA. As k 
becomes larger, both models are eventually rejected, 
because Weissman’s theory applies only to the tail of 
a distribution. Although the additional information 
obtained by varying k intuitively leads to a choice of 
the LAMBDA distribution, how to use such information 
has not been formalized statistically. 

An alternate solution to the problem of discerning 
the limiting-distribution family is to calculate the I — p e 
quantile from both family models and to use the most 
conservative value. However, sometimes the poorly fit- 
ting model gives astronomical values — leading to unac- 
ceptable answers. An alternative approach is to pursue 
additional statistical methods to determine the limiting- 
distribution family. Such methods are not presented 
here. 

The remaining calculations are performed with the 
assumption that LAMBDA is the correct limiting dis- 
tribution. Using the LAMBDA case analysis, the 1 ~p e 
quantile was estimated to be 15.383 msec. The combi- 
natorial analysis has shown that e must be at least the 
1 — Pe quantile to meet the system reliability require- 
ments. Using this quantile to estimate the upper bound 

6 = 15.383 msec 

The conservative nature of this estimate can be seen by 
comparison with the maximum observation, 4.49 msec. 
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LAMBDA 
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a = 6 % 



Figure 9. Results of tests for limiting-distribution family. 
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Validation Step 4 

The estimated values of p and e must be inserted 
into the theoretical expression for maximum clock skew 
from the design-proof theorem: 

e N f 

5 > S 0 *t pR , 

6 

and 

6 <C e/p 

The directly measured and indirectly estimated val- 
ues of the system parameters are as follows: 

JV = 4 
m — 1 
R — 30 sec 
S = 615.334 msec 
e = 15.383 msec 
p = 41.42657 /zsec/sec 

Using these values, the maximum clock skew 6 can be 
computed as follows: 

<5 = [N/(N - 3m)] {2e + p[R + 2 (N - m)S/N}} 

= 123.061 + 1.65706 x 1(T 4 (30000 + 3(615.334)/2) 
= 123.061 + 5.124 
= 128.185 msec 

Thus, the clocks remain synchronized to within 
128.185 msec with probability not less than 1 — 10~ 9 
if the synchronization period is 30 seconds. The contri- 
bution of the second term is small relative to the first. 
This reveals that, in this implementation, the clocks are 
much more accurate than the interprocess communica- 
tion subsystem. 

Validation Step 5 

As discussed previously, the communication subsys- 
tem of a real-time system depends critically on syn- 
chronization being maintained within a certain bound. 
If the calculated skew is less than the bound used in 
the design of the communication subsystem, then the 
synchronization system has been validated. Otherwise, 
the real-time system must be redesigned if the relia- 
bility requirements are to be met. This may be accom- 
plished by either slowing down the communications sys- 
tem (i.e., waiting longer for interprocess data) or mak- 
ing improvements to reduce p and/or 6. The trade-off 
between performance and reliability is explored in detail 
in the next section. 


These validation steps can be reversed to compute 
system reliability, given a specific design value for the 
maximum clock skew 6 . Essentially, p is estimated 
as described previously such that p 1 = 10” 7 . Then 
6 is computed from the formula of the theorem using 
these values of p and S. Next, the probability that e is 
exceeded, p 2 can be computed, and hence, p sya can be 
determined. Thus, the system reliability is determined, 
given a specific design choice for the maximum clock 
skew. 

Additional Observations About the Clock 
Synchronization Algorithm 

The synchronization algorithm is executed periodi- 
cally and utilizes CPU time during each execution. The 
major component of the execution time is the time re- 
quired to read the clock of every other processor in the 
system. In SIFT, each processor’s clock value is broad- 
cast during a window of time allocated to it. There 
are N such windows, one for each processor in the sys- 
tem. All other processors wait during this window to 
receive the broadcast clock value. To accommodate the 
worst-case situation, each window must be at least B+6 
seconds long, where B is the maximum broadcast time 
(i.e., v + e) and 6 is the maximum clock skew. Hence, 
the execution time for the clock synchronization algo- 
rithm can be represented as 

S = N{6 + B) + K 

where K is the time needed to compute the correction 
factor and to correct the clock. The execution time of 
the synchronization task S contributes to the inability 
to synchronize perfectly, because the clocks continue to 
drift apart while the synchronization task is executing. 
Thus, the equations defining 6 and S are coupled as 
follows: 

5 = S{N, 6,B,K) 

6 = (iV,m,e,p, R,S) 

Also, e is dependent on the system reliability require- 
ment p 3 y S and the synchronization period R. Therefore, 

e = e(p S y S , R) 

Although an algebraic solution is tedious, S can eas- 
ily be numerically determined as a function of the syn- 
chronization period R or the fraction of time spent syn- 
chronizing S/R. It is more informative, however, to 
relate the experimental results to the performance of a 
hypothetical real-time communication system. A real- 
time communication system relies on the synchroniza- 
tion task, as shown in figure 1. The minimum commu- 
nication time is B + £, since the system must wait at 
least this long to insure that the data value has arrived 


R + 2 


m* ■ 
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before accessing it. Since B = u + e, the minimum com- 
munication time can be expressed as (v + e + <5). This 
minimum communication time represents the impact of 
the synchronization system on the performance of the 
real-time system. Thus, the performance of the real- 
time communication system is a function of the system 
reliability. This performance-reliability trade-off is il- 
lustrated in figure 10. As expected, as the reliability 
requirement is relaxed, the performance of the system 
increases. Also, the performance is a function of the 
synchronization period R or the fraction of time spent 
synchronizing S/R. This performance-overhead trade- 
off is illustrated in figures 11 and 12. 

Concluding Remarks 

The validation method presented in this paper is an 
exploitation of the precision with which the formal proof 
method reduces the complexity of the system to verifi- 
able axioms about the system behavior. Although the 
proof process itself is very costly, it is extremely valu- 
able when attempting to validate the crucial synchro- 
nization subsystem. The validation method introduced 
in this paper is essentially to: (1) perform a design proof 
of the synchronization algorithm under the assumption 
of low-level system behavior axioms, (2) perform a code 
proof of the synchronization code, and (3) experimen- 
tally estimate the probability that the system behavior 
axioms will be violated and include these failure prob- 
abilities into a reliability analysis of the system. The 
Software Implemented Fault Tolerance (SIFT) synchro- 
nization design proof provided the basis for step (I). 
Step (2) has not yet been attempted, but will be per- 
formed under NASA contract NASI- 17067. Step (3) is 


explored in detail in this paper. Although the valida- 
tion theory is not directly applied to the data obtained 
from the SIFT hardware, the validation theory is ap- 
plied to data obtained from an experimental system in 
the Langley Avionics Integration Research Laboratory 
(AIRLAB). The purpose of this paper is to define a 
validation method rather than specifically validate the 
SIFT synchronization subsystem. After the develop- 
ment of the SIFT data-retrieval system in early 1984, 
this theory will be applied to the SIFT hardware. 

The design proof process reduces the performance 
of the clock synchronization algorithm to an algebraic 
expression of certain system parameters. These param- 
eters, which are defined by formal axioms, represent 
worst-case bounds on system performance. By simple 
combinatorial analysis, the system reliability require- 
ments can be translated to reliability requirements on 
these bounds. Because the estimation of a bound of a 
random variable is required, statistical methods appli- 
cable to the tail of a distribution are employed. The 
estimated parameters are substituted into the algebraic 
expression which calculates the worst-case performance 
of the synchronization system. This estimated worst- 
case performance (given the specified reliability con- 
straints) is compared against the system design value. 
This validation process thus yields estimates of both 
performance and reliability. 

Langley Research Center 

National Aeronautics and Space Administration 
Hampton, VA 23665 
July 18, 1984 
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Appendix 

Derivation of a Simplified e qp Formula 

The synchronization theory is expressed in terms of 
real time. For example, the e bound is defined in terms 
of the r(T) function. Unfortunately, the state variables 
of a synchronization system are maintained in terms of 
clock times. In this section, the difference between the 
actual e qp and one derived from clock measurements is 
shown to be negligible. 

Lemma 1 : If clock r is good, then r(Tb + A) r(T 0 ) + A 
for small A. 


Proof: From the definition of a good clock, 

. p 

dT 


M)<£<K) 


By integration, 

r*To + A 


<r>s)- 


r? 0+A 
/To 

Evaluating these integrals yields 

(1 - p/2 ) A < r(T 0 + A) - r(T 0 ) < (1 + p/2 ) A 


or 

|r(ro + A)-[r(ro) + A]|<(p/2)A 


Thus, if (p/2) A is negligible (p is typically on the order 
of 10“ 5 , and A is typically on the order of 10 ~ 6 ), then 


Lemma 2 : If clock C is good, then C(y + A) C(y)+A 

for small A. 

Proof: By the previous lemma, r(Tb + A) r(T 0 ) + A. 
Letting x “ r(T 0 + A) and y = r(T 0 ), then 

C(x)=T 0 + A 

C(y) = T 0 

and 

x ph y + A 

Thus, 

(7(a:) - A == T 0 = C^y), and C(x) « C(y + A) 
Hence, 

C(y + A) » C(y) + A 

Using these lemmas, it is easy to demonstrate that 
we can accurately determine e qp from measured clock 
data: 

Theorem : For good clocks p and q: C p (y)— C q (y)+e qp = 
A <?p- 

Proof: By definition, 

e 9P = r p (To + A qp ) — r q (To) 

Letting x = r(T 0 + A gp ) and y = r q (T 0 ) yields 
Cp(x) — A 9P — Tq = C q (y) 

Since e qp — x — y, 

Cp(cqp "F y) ~ ' A qp = C'g(y) 

From lemma 2, we conclude that 


r(T 0 + A) & r(To) + A 


G P (y) C q (y) 4* e qp — A, 
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